KCAT: A Korean Corpus Annotating Tool Minimizing Human Intervention

نویسندگان

  • Won-Ho Ryu
  • Jin-Dong Kim
  • Hae-Chang Rim
  • Heui-Seok Lim
چکیده

While large POS(part-of-speech) annotated corpora play an important role in natural language processing, the annotated corpus requires very high accuracy and consistency. To build such an accurate and consistent corpus, we often use a manual tagging method. But the manual tagging is very labor intensive and expensive. Furthernaore, it is not easy to get consistent results from the humari experts. In this paper, we present an efficient tool lbr building large accurate and consistent corpora with minimal human labor. The proposed tool supports semiautomatic tagging. Using disambiguation rules acquired from human experts, it minimizes the human intervention in both the manual tagging and post-editing steps.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

NTU-MC Toolkit: Annotating a Linguistically Diverse Corpus

The NTU-MC Toolkit is a compilation of tools to annotate the Nanyang Technological University Multilingual Corpus (NTU-MC). The NTU-MC is a parallel corpora of linguistically diverse languages (Arabic, English, Indonesian, Japanese, Korean, Mandarin Chinese, Thai and Vietnamese). The NTU-MC thrives on the mantra of "more data is better data and more annotation is better information". Other than...

متن کامل

TESLA: A Tool for Annotating Geospatial Language Corpora

In this paper, we present The gEoSpatial Language Annotator (TESLA)—a tool which supports human annotation of geospatial language corpora. TESLA interfaces with a GIS database for annotating grounded geospatial entities and uses Google Earth for visualization of both entity search results and evolving object and speaker position from GPS tracks. We also discuss a current annotation effort using...

متن کامل

Annotating Korean Demonstratives

This paper presents preliminary work on a corpus-based study of Korean demonstratives. Through the development of an annotation scheme and the use of spoken and written corpora, we aim to determine different functions of demonstratives and to examine their distributional properties. Our corpus study adopts similar features of annotation used in Botley and McEnery (2001) and provides some lingui...

متن کامل

Arabic anaphora resolution: corpora annotation with coreferential links

Annotated resources are much needed for evaluation and training of anaphora resolution systems. The coreferential chain annotation is a difficult task which can not be realised without an appropriate tool. In this paper, we present our work on Arabic corpora annotation with anaphoric links (i.e., the annotation of the identity relation between the anaphors and their antecedents). In particular,...

متن کامل

Corpus building for Mongolian language

This paper presents an ongoing research aimed to build the first corpus, 5 million words, for Mongolian language by focusing on annotating and tagging corpus texts according to TEI XML (McQueen, 2004) format. Also, a tool, MCBuilder, which provides support for flexibly and manually annotating and manipulating the corpus texts with XML structure, is presented.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000